Click here to see detailed information about how the demographic data was wrangled
1977-1979
Now let’s take a look at our demographic data across the decades that we wish to study. If you have very wide data (meaning it has many columns), one way to view the data so that you can see all of the columns at the same time is to use the glimpse() function of the dplyr package.
Taking a look at the first decade of data, we can see that the Race/Sex Indicator contains two types of data, the race and the sex. This does not follow the tidy data philosophy, where each cell of a tibble should only contain one piece of information. Typically one might think of using the separate() function of the tidyr package to split this variable into two. However, one of the race values is Other races and since this also has a space, this makes separating this data more tricky.
Instead we will use the str_extract() function of the stringr package and the mutate() function of the dplyr package. The “mutate()” will allow us to create new variables, and “str_extract()” function will allow us to match specific patterns and pull out matches to those patterns. Therefore, if the Race/Sex Indicator value is Other races male and if we extract patterns matching either "male" or "female" which we can specify like this pattern = "male|female" then, the value will be male.
First we need to rename the Race/Sex Indicator variable to not have spaces so that it is compatible with the str_extract() function.
We also want to rename a couple of variables to be simpler and filter the data to only include the years of the data we are interested in, as well as remove some variables that we don’t need like the FIPS State Code. We can remove variables by using the select() function with a - minus sign in front of the variable we wish to remove.
Rows: 3,060
Columns: 22
$ `Year of Estimate` <dbl> 1970, 1970, 1970, 1970, 1970, 1970, 1970, 1970, …
$ `FIPS State Code` <chr> "01", "01", "01", "01", "01", "01", "02", "02", …
$ `State Name` <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black male", "Bla…
$ `Under 5 years` <dbl> 105856, 100613, 47403, 47079, 244, 250, 12382, 1…
$ `5 to 9 years` <dbl> 120876, 115194, 55443, 54851, 255, 251, 13888, 1…
$ `10 to 14 years` <dbl> 129091, 122352, 60427, 60065, 253, 245, 13255, 1…
$ `15 to 19 years` <dbl> 119500, 116107, 52921, 55144, 281, 254, 11179, 9…
$ `20 to 24 years` <dbl> 103665, 108513, 29948, 35165, 413, 331, 20237, 1…
$ `25 to 29 years` <dbl> 86538, 88359, 19535, 23662, 239, 302, 12538, 107…
$ `30 to 34 years` <dbl> 74452, 77595, 17196, 22021, 236, 284, 10331, 865…
$ `35 to 39 years` <dbl> 71511, 74941, 16654, 22248, 161, 279, 9548, 7510…
$ `40 to 44 years` <dbl> 75242, 78908, 17564, 24249, 127, 253, 8282, 6353…
$ `45 to 49 years` <dbl> 73874, 78589, 18186, 23028, 108, 148, 6995, 5820…
$ `50 to 54 years` <dbl> 68048, 72481, 17618, 22104, 95, 100, 5609, 4494,…
$ `55 to 59 years` <dbl> 61071, 67699, 18118, 21909, 88, 93, 4029, 2986, …
$ `60 to 64 years` <dbl> 52361, 61065, 16456, 20068, 69, 94, 2392, 1830, …
$ `65 to 69 years` <dbl> 38977, 49685, 14498, 19364, 54, 73, 1292, 965, 2…
$ `70 to 74 years` <dbl> 26767, 37227, 9541, 12509, 70, 66, 602, 496, 8, …
$ `75 to 79 years` <dbl> 17504, 27163, 6030, 8291, 31, 52, 326, 305, 1, 5…
$ `80 to 84 years` <dbl> 9937, 16470, 3485, 5031, 37, 30, 211, 186, 4, 5,…
$ `85 years and over` <dbl> 5616, 10445, 2448, 4035, 76, 29, 143, 126, 19, 4…
Rows: 918
Columns: 22
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
$ `Under 5 years` <dbl> 98814, 94595, 46201, 45784, 590, 621, 14316, 1353…
$ `5 to 9 years` <dbl> 113365, 107395, 50097, 49329, 672, 660, 14621, 13…
$ `10 to 14 years` <dbl> 123107, 116182, 54925, 53955, 677, 653, 14795, 13…
$ `15 to 19 years` <dbl> 135343, 130433, 58468, 59926, 674, 605, 15207, 13…
$ `20 to 24 years` <dbl> 126053, 125352, 43898, 51433, 722, 773, 20106, 16…
$ `25 to 29 years` <dbl> 111547, 112471, 31014, 36648, 638, 835, 20444, 18…
$ `30 to 34 years` <dbl> 100674, 101543, 22528, 26694, 571, 766, 17514, 15…
$ `35 to 39 years` <dbl> 81038, 83369, 17473, 22213, 498, 586, 13098, 1069…
$ `40 to 44 years` <dbl> 75042, 77793, 16446, 22146, 356, 479, 10067, 7935…
$ `45 to 49 years` <dbl> 76296, 79753, 16578, 22576, 295, 432, 8460, 6848,…
$ `50 to 54 years` <dbl> 74844, 81079, 17117, 23028, 206, 326, 7268, 5914,…
$ `55 to 59 years` <dbl> 67785, 75905, 16437, 21435, 166, 213, 5398, 4485,…
$ `60 to 64 years` <dbl> 58853, 69406, 16276, 21075, 145, 174, 3349, 2708,…
$ `65 to 69 years` <dbl> 48848, 62430, 15837, 21126, 107, 173, 1714, 1468,…
$ `70 to 74 years` <dbl> 34475, 50075, 11450, 16028, 90, 138, 915, 928, 22…
$ `75 to 79 years` <dbl> 20977, 34027, 7601, 10825, 53, 106, 500, 493, 10,…
$ `80 to 84 years` <dbl> 10831, 21483, 3896, 6272, 25, 49, 237, 268, 4, 7,…
$ `85 years and over` <dbl> 6683, 15729, 2667, 5426, 33, 41, 153, 211, 11, 6,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
That’s looking pretty good! We also want to take all the age group variables and make one variable that is the age group name and one that is the value of the population count for that age group. To do this we will use the pivot_longer() function of the tidyr package. To use this function, we need to use the cols argument to indicate which columns we want to pivot. We also name the new variables we will create with the names_to and values_to arguments. The names_to will be the name of the variable that will identify each age group and values_to will be the name of the variable that contains the corresponding population values.
Rows: 16,524
Columns: 6
$ YEAR <dbl> 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977, 1977,…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala…
$ SEX <chr> "male", "male", "male", "male", "male", "male", "male", "ma…
$ RACE <chr> "White", "White", "White", "White", "White", "White", "Whit…
$ AGE_GROUP <chr> "Under 5 years", "5 to 9 years", "10 to 14 years", "15 to 1…
$ SUB_POP <dbl> 98814, 113365, 123107, 135343, 126053, 111547, 100674, 8103…
We also want to get data about the total population for the state for each year.
To do so we can sum all the values for the SUB_POP variable that we just created. To do this we can use the group_by and summarize() functions of the dplyr package. The group_by() function specifies how we want to calculate our sum, that we would like to calculate it for each year and each state individually. Thus, all the values that have the same STATE and YEAR values will be summed together, rather than summing using all of the values in the SUB_POP variable. The .groups argument allows us to remove the grouping after we perform the calculation with summarize().
# A tibble: 153 x 3
YEAR STATE TOT_POP
<dbl> <chr> <dbl>
1 1977 Alabama 3782571
2 1977 Alaska 397220
3 1977 Arizona 2427296
4 1977 Arkansas 2207195
5 1977 California 22350332
6 1977 Colorado 2696179
7 1977 Connecticut 3088745
8 1977 Delaware 594815
9 1977 District of Columbia 681766
10 1977 Florida 8888806
# … with 143 more rows
Now we will add the population value to the demographic tibble using the left_join() function of the dplyr package. It is important that we specify how this should be done, that the YEAR and STATE variable values should match each other. This will place the dem_77_79 variables to the left of the pop_77_79 data.
# A tibble: 16,524 x 7
YEAR STATE SEX RACE AGE_GROUP SUB_POP TOT_POP
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 1977 Alabama male White Under 5 years 98814 3782571
2 1977 Alabama male White 5 to 9 years 113365 3782571
3 1977 Alabama male White 10 to 14 years 123107 3782571
4 1977 Alabama male White 15 to 19 years 135343 3782571
5 1977 Alabama male White 20 to 24 years 126053 3782571
6 1977 Alabama male White 25 to 29 years 111547 3782571
7 1977 Alabama male White 30 to 34 years 100674 3782571
8 1977 Alabama male White 35 to 39 years 81038 3782571
9 1977 Alabama male White 40 to 44 years 75042 3782571
10 1977 Alabama male White 45 to 49 years 76296 3782571
# … with 16,514 more rows
We will also calculate the percentage that each group makes up of the total population, by dividing the SUB_POP by the TOT_POP and multiplying by 100 using the mutate() function. we will also remove the other population variables.
# A tibble: 16,524 x 6
YEAR STATE SEX RACE AGE_GROUP PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1977 Alabama male White Under 5 years 2.61
2 1977 Alabama male White 5 to 9 years 3.00
3 1977 Alabama male White 10 to 14 years 3.25
4 1977 Alabama male White 15 to 19 years 3.58
5 1977 Alabama male White 20 to 24 years 3.33
6 1977 Alabama male White 25 to 29 years 2.95
7 1977 Alabama male White 30 to 34 years 2.66
8 1977 Alabama male White 35 to 39 years 2.14
9 1977 Alabama male White 40 to 44 years 1.98
10 1977 Alabama male White 45 to 49 years 2.02
# … with 16,514 more rows
It is important to make sure that we have the total values we would expect. We have two levels of SEX, three levels of Race, three levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE. If we multiply this together we get 16,524 which is the same as the number of rows in our final dem_77_79 data. Looks good!
Also Let’s make the values of the SEX variable capitalized so that they match the other values of the other variables like RACE etc. This will help us to keep consistent values across the different years as we wrangle the data for the other decades. To do so we will use the str_to_title() function of the stringr package. We need to use the pull() function to get the values of SEX out of dem_77_79. Once we make them capitalized they are then reassigned to the SEX variable.
1980-1989
For this decade each year is a separate tibble and they are combined as a list.
[1] "list"
So the first thing we need to do is combine each tibble of the list together. We can do that using the bind_rows() function of dplyr which appends the data together based on the presence of columns with the same name in the different tibbles. We will use the map_df() function of the purrr package to allow us to do this across each tibble in our list.
Rows: 188,460
Columns: 21
$ `Year of Estimate` <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Race/Sex Indicator` <chr> "White male", "White female", "Black ma…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
Great! Now our data is all together.
Now we will wrangle the data similarly to the previous decade.
Rows: 188,460
Columns: 22
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 198…
$ `FIPS State and County Codes` <chr> "01001", "01001", "01001", "01001", "01…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 6…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 2…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 1…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 1…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 1…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 16…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 1…
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65,…
$ SEX <chr> "male", "female", "male", "female", "ma…
$ RACE <chr> "White", "White", "Black", "Black", "Ot…
Notice that this time the state information is based on the numeric FIPS value. We want only the first two values, as the rest indicate the county. We can use the str_sub() function of the stringr package for this. We will specify that we want to start at the first position and end at the second. Just like str_extract() we need to rename this variable first so that it is compatible.
Rows: 188,460
Columns: 23
$ YEAR <dbl> 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1980, 1…
$ STATEFP_temp <chr> "01001", "01001", "01001", "01001", "01001", "010…
$ `Under 5 years` <dbl> 985, 831, 357, 346, 4, 7, 2422, 2346, 672, 645, 3…
$ `5 to 9 years` <dbl> 1096, 987, 427, 395, 9, 8, 2661, 2467, 740, 680, …
$ `10 to 14 years` <dbl> 1271, 1074, 395, 415, 4, 11, 2783, 2614, 644, 670…
$ `15 to 19 years` <dbl> 1308, 1259, 460, 429, 10, 5, 3049, 2841, 711, 762…
$ `20 to 24 years` <dbl> 972, 1006, 300, 380, 3, 3, 2423, 2428, 516, 601, …
$ `25 to 29 years` <dbl> 850, 912, 240, 235, 2, 11, 2372, 2475, 414, 469, …
$ `30 to 34 years` <dbl> 891, 983, 163, 196, 4, 10, 2410, 2400, 303, 352, …
$ `35 to 39 years` <dbl> 942, 1015, 120, 158, 3, 12, 2101, 2202, 224, 260,…
$ `40 to 44 years` <dbl> 854, 882, 133, 147, 2, 11, 1881, 1859, 206, 288, …
$ `45 to 49 years` <dbl> 828, 739, 107, 154, 4, 11, 1708, 1694, 219, 236, …
$ `50 to 54 years` <dbl> 631, 602, 113, 165, 1, 7, 1657, 1798, 203, 261, 7…
$ `55 to 59 years` <dbl> 524, 532, 113, 150, 1, 2, 1641, 1943, 178, 219, 8…
$ `60 to 64 years` <dbl> 428, 451, 126, 166, 0, 1, 1630, 1819, 171, 209, 8…
$ `65 to 69 years` <dbl> 358, 417, 128, 160, 1, 0, 1503, 1729, 170, 232, 6…
$ `70 to 74 years` <dbl> 242, 332, 87, 119, 0, 0, 1163, 1335, 164, 182, 4,…
$ `75 to 79 years` <dbl> 123, 237, 70, 94, 0, 0, 671, 906, 87, 129, 3, 6, …
$ `80 to 84 years` <dbl> 52, 137, 31, 57, 0, 0, 331, 527, 43, 67, 1, 2, 56…
$ `85 years and over` <dbl> 39, 86, 13, 44, 0, 1, 187, 408, 27, 65, 1, 1, 30,…
$ SEX <chr> "male", "female", "male", "female", "male", "fema…
$ RACE <chr> "White", "White", "Black", "Black", "Other", "Oth…
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alab…
# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 50108
2 1980 Alabama 10 to 14 years female Other 805
3 1980 Alabama 10 to 14 years female White 109066
4 1980 Alabama 10 to 14 years male Black 50768
5 1980 Alabama 10 to 14 years male Other 826
6 1980 Alabama 10 to 14 years male White 115988
7 1980 Alabama 15 to 19 years female Black 58428
8 1980 Alabama 15 to 19 years female Other 743
9 1980 Alabama 15 to 19 years female White 126783
10 1980 Alabama 15 to 19 years male Black 56808
# … with 55,070 more rows
# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <chr> <chr> <chr> <dbl>
1 1980 Alabama 10 to 14 years female Black 1.28
2 1980 Alabama 10 to 14 years female Other 0.0206
3 1980 Alabama 10 to 14 years female White 2.80
4 1980 Alabama 10 to 14 years male Black 1.30
5 1980 Alabama 10 to 14 years male Other 0.0212
6 1980 Alabama 10 to 14 years male White 2.97
7 1980 Alabama 15 to 19 years female Black 1.50
8 1980 Alabama 15 to 19 years female Other 0.0191
9 1980 Alabama 15 to 19 years female White 3.25
10 1980 Alabama 15 to 19 years male Black 1.46
# … with 55,070 more rows
Just like with the data from the 70s we will also change the values for SEX to be capitalized.
Again, it is important to make sure that we have the total values we would expect. This time we have: two levels of SEX, three levels of Race, ten levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE.
If we multiply these together we get 55,080, which is the same as the number of rows of the final dem_80_89 data. Looks good!
1990-1999
Just like the 80s we need to combine the data across the files:
Rows: 43,870
Columns: 19
$ Year <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 19…
$ e <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "0…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
$ Male <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605, …
$ Female <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316, …
$ Male_1 <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 10…
$ Female_1 <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 97…
$ Male_2 <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179,…
$ Female_2 <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 202…
$ Male_3 <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, 1…
$ Female_3 <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, 1…
$ Male_4 <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, 1…
$ Female_4 <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, 1…
$ Male_5 <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32, …
$ Female_5 <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29, …
$ Male_6 <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4,…
$ Female_6 <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6,…
$ Male_7 <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 11…
$ Female_7 <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3, …
For this decade the column names can’t all be imported in a simple way from the table, so they need to be recoded.
Here is what the data looks like before importing:

So, first using the base colnames() function we change the names of the column names.
colnames(dem_90_99) <- c("YEAR",
"STATEFP",
"Age",
"NH_W_M",
"NH_W_F",
"NH_B_M",
"NH_B_F",
"NH_AIAN_M",
"NH_AIAN_F",
"NH_API_M",
"NH_API_F",
"H_W_M",
"H_W_F",
"H_B_M",
"H_B_F",
"H_AIAN_M",
"H_AIAN_F",
"H_API_M",
"H_API_F")
glimpse(dem_90_99)
Rows: 43,870
Columns: 19
$ YEAR <dbl> NA, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
$ STATEFP <chr> NA, "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
$ Age <dbl> NA, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
$ NH_W_M <dbl> NA, 20406, 19393, 18990, 19246, 19502, 19560, 19091, 19605,…
$ NH_W_F <dbl> NA, 19101, 18114, 18043, 17786, 18366, 18386, 18047, 18316,…
$ NH_B_M <dbl> NA, 9794, 9475, 9097, 9002, 9076, 9169, 8919, 9219, 9247, 1…
$ NH_B_F <dbl> NA, 9414, 9247, 8837, 8701, 8989, 9093, 8736, 9192, 9108, 9…
$ NH_AIAN_M <dbl> NA, 103, 87, 97, 94, 108, 128, 160, 178, 166, 205, 194, 179…
$ NH_AIAN_F <dbl> NA, 90, 93, 100, 115, 114, 130, 134, 162, 155, 193, 185, 20…
$ NH_API_M <dbl> NA, 192, 146, 175, 150, 168, 170, 183, 171, 136, 177, 169, …
$ NH_API_F <dbl> NA, 170, 182, 160, 157, 178, 158, 173, 177, 185, 179, 171, …
$ H_W_M <dbl> NA, 223, 190, 198, 186, 190, 210, 188, 178, 182, 221, 194, …
$ H_W_F <dbl> NA, 220, 196, 173, 191, 190, 170, 172, 179, 173, 166, 175, …
$ H_B_M <dbl> NA, 47, 41, 32, 35, 36, 30, 28, 27, 29, 32, 31, 33, 34, 32,…
$ H_B_F <dbl> NA, 45, 47, 41, 30, 26, 37, 23, 35, 31, 28, 38, 22, 39, 29,…
$ H_AIAN_M <dbl> NA, 1, 2, 1, 9, 5, 8, 2, 4, 6, 6, 0, 1, 9, 6, 7, 5, 2, 2, 4…
$ H_AIAN_F <dbl> NA, 8, 0, 2, 1, 4, 5, 3, 4, 4, 3, 4, 2, 2, 7, 0, 2, 2, 1, 6…
$ H_API_M <dbl> NA, 5, 7, 2, 3, 5, 11, 2, 7, 12, 10, 7, 5, 6, 5, 6, 6, 2, 1…
$ H_API_F <dbl> NA, 5, 5, 5, 3, 14, 6, 7, 6, 3, 11, 5, 5, 7, 8, 6, 6, 7, 3,…
Notice also that the first row is all NA values from white space in the original table for 1990, this is probably true for each year. We can check them dimensions of our table using the base dim() function. When we filter for rows where YEAR is NA, we indeed see 10 rows, which is what we would expect if we have a row like this for each of the years in the decade. We see the same if we try a different variable. Now we will test to see how large our tibble is if we drop rows with NA values using the drop_na() function of tidyr. We that indeed our dimensions only changed by ten, so there are not other rows with missing values that we might not expect. So now we will resign the dem_90_99 variable after removing these rows.
[1] 43870 19
# A tibble: 10 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA
2 NA <NA> NA NA NA NA NA NA NA NA
3 NA <NA> NA NA NA NA NA NA NA NA
4 NA <NA> NA NA NA NA NA NA NA NA
5 NA <NA> NA NA NA NA NA NA NA NA
6 NA <NA> NA NA NA NA NA NA NA NA
7 NA <NA> NA NA NA NA NA NA NA NA
8 NA <NA> NA NA NA NA NA NA NA NA
9 NA <NA> NA NA NA NA NA NA NA NA
10 NA <NA> NA NA NA NA NA NA NA NA
# … with 9 more variables: NH_API_F <dbl>, H_W_M <dbl>, H_W_F <dbl>,
# H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>, H_API_M <dbl>,
# H_API_F <dbl>
# A tibble: 10 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 NA <NA> NA NA NA NA NA NA NA NA
2 NA <NA> NA NA NA NA NA NA NA NA
3 NA <NA> NA NA NA NA NA NA NA NA
4 NA <NA> NA NA NA NA NA NA NA NA
5 NA <NA> NA NA NA NA NA NA NA NA
6 NA <NA> NA NA NA NA NA NA NA NA
7 NA <NA> NA NA NA NA NA NA NA NA
8 NA <NA> NA NA NA NA NA NA NA NA
9 NA <NA> NA NA NA NA NA NA NA NA
10 NA <NA> NA NA NA NA NA NA NA NA
# … with 9 more variables: NH_API_F <dbl>, H_W_M <dbl>, H_W_F <dbl>,
# H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>, H_API_M <dbl>,
# H_API_F <dbl>
# A tibble: 43,860 x 19
YEAR STATEFP Age NH_W_M NH_W_F NH_B_M NH_B_F NH_AIAN_M NH_AIAN_F NH_API_M
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1990 01 0 20406 19101 9794 9414 103 90 192
2 1990 01 1 19393 18114 9475 9247 87 93 146
3 1990 01 2 18990 18043 9097 8837 97 100 175
4 1990 01 3 19246 17786 9002 8701 94 115 150
5 1990 01 4 19502 18366 9076 8989 108 114 168
6 1990 01 5 19560 18386 9169 9093 128 130 170
7 1990 01 6 19091 18047 8919 8736 160 134 183
8 1990 01 7 19605 18316 9219 9192 178 162 171
9 1990 01 8 18823 17743 9247 9108 166 155 136
10 1990 01 9 20226 19178 10194 9784 205 193 177
# … with 43,850 more rows, and 9 more variables: NH_API_F <dbl>, H_W_M <dbl>,
# H_W_F <dbl>, H_B_M <dbl>, H_B_F <dbl>, H_AIAN_M <dbl>, H_AIAN_F <dbl>,
# H_API_M <dbl>, H_API_F <dbl>
Then we sum across the non-hispanic and Hispanic groups because this information is not available for the other previous decades. Then we will remove the variables for the Hispanic and non-Hispanic subgroups using select().
Rows: 43,860
Columns: 11
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1…
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
$ Age <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ W_M <dbl> 20629, 19583, 19188, 19432, 19692, 19770, 19279, 19783, 19005…
$ W_F <dbl> 19321, 18310, 18216, 17977, 18556, 18556, 18219, 18495, 17916…
$ B_M <dbl> 9841, 9516, 9129, 9037, 9112, 9199, 8947, 9246, 9276, 10226, …
$ B_F <dbl> 9459, 9294, 8878, 8731, 9015, 9130, 8759, 9227, 9139, 9812, 1…
$ AIAN_M <dbl> 104, 89, 98, 103, 113, 136, 162, 182, 172, 211, 194, 180, 209…
$ AIAN_F <dbl> 98, 93, 102, 116, 118, 135, 137, 166, 159, 196, 189, 204, 198…
$ API_M <dbl> 197, 153, 177, 153, 173, 181, 185, 178, 148, 187, 176, 164, 1…
$ API_F <dbl> 175, 187, 165, 160, 192, 164, 180, 183, 188, 190, 176, 168, 1…
Looking better! We also need to add age groups like the other decades. We will take a look at the 80s data using the distinct() function of the dplyr package to see what age groups we need. We can use the base cut() function to create a new variable with mutate() called AGE_GROUP that will have a label for every change in 5 years of age. The right = FALSE argument specifies that the interval is not closed on the right, meaning that if the value is at the cut point like the Age value is 5, then it will be in the 5 to 9 years group.
We can make the labels for the AGE_GROUP variable match those of dem_77_79 but we need to pull out the values of the tibble created by distinct(). To do this we can use the pull() function from the dplyr package. Note that it is important to check that the AGE_GROUP values are listed in order for dem_77_79. We will also remove the Age variable after we create the new AGE_GROUP variable for the dem_90_99 data.
# A tibble: 18 x 1
AGE_GROUP
<chr>
1 Under 5 years
2 5 to 9 years
3 10 to 14 years
4 15 to 19 years
5 20 to 24 years
6 25 to 29 years
7 30 to 34 years
8 35 to 39 years
9 40 to 44 years
10 45 to 49 years
11 50 to 54 years
12 55 to 59 years
13 60 to 64 years
14 65 to 69 years
15 70 to 74 years
16 75 to 79 years
17 80 to 84 years
18 85 years and over
[1] "Under 5 years" "5 to 9 years" "10 to 14 years"
[4] "15 to 19 years" "20 to 24 years" "25 to 29 years"
[7] "30 to 34 years" "35 to 39 years" "40 to 44 years"
[10] "45 to 49 years" "50 to 54 years" "55 to 59 years"
[13] "60 to 64 years" "65 to 69 years" "70 to 74 years"
[16] "75 to 79 years" "80 to 84 years" "85 years and over"
Rows: 43,860
Columns: 11
$ YEAR <dbl> 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990,…
$ STATEFP <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "01",…
$ W_M <dbl> 20629, 19583, 19188, 19432, 19692, 19770, 19279, 19783, 190…
$ W_F <dbl> 19321, 18310, 18216, 17977, 18556, 18556, 18219, 18495, 179…
$ B_M <dbl> 9841, 9516, 9129, 9037, 9112, 9199, 8947, 9246, 9276, 10226…
$ B_F <dbl> 9459, 9294, 8878, 8731, 9015, 9130, 8759, 9227, 9139, 9812,…
$ AIAN_M <dbl> 104, 89, 98, 103, 113, 136, 162, 182, 172, 211, 194, 180, 2…
$ AIAN_F <dbl> 98, 93, 102, 116, 118, 135, 137, 166, 159, 196, 189, 204, 1…
$ API_M <dbl> 197, 153, 177, 153, 173, 181, 185, 178, 148, 187, 176, 164,…
$ API_F <dbl> 175, 187, 165, 160, 192, 164, 180, 183, 188, 190, 176, 168,…
$ AGE_GROUP <fct> Under 5 years, Under 5 years, Under 5 years, Under 5 years,…
Like the previous decades we will create a RACE and SUB_POP variable using pivot_longer() to create a single Race variable out of all the subgroup variables.
Now we need to collapse the data for the various races so that it matches the previous decades. This time we will use the case_when() function of the dplyr package and the str_detect() function of the stringr package to identify when the race is something other than B or W and replace with the value Other. The value to the right of the ~ indicates what we want the value of the new variable to be if the value of the variable we are using with str_decect() matches the condition specified. If the value does not match the specified condition, than the other values will be what ever is listed after TRUE ~. We will then create population counts as we did previously for the other decades.
Finally, we will create new sums for the sub-populations where we sum across the two Other subgroups Race to a create a single value for each value of YEAR, SEX, AGE_GROUP, and STATE by using the group_by() function and summarie().
dem_90_99 %<>%
pivot_longer(cols = c(starts_with("W_"),
starts_with("B_"),
starts_with("AIAN_"),
starts_with("API_")),
names_to = "RACE",
values_to = "SUB_POP_temp")
dem_90_99 %<>%
mutate(SEX = case_when(str_detect(RACE, "_M") ~ "Male",
TRUE ~ "Female"),
RACE = case_when(str_detect(RACE, "W_") ~ "White",
str_detect(RACE, "B_") ~ "Black",
TRUE ~ "Other")) %>%
left_join(STATE_FIPS, by = "STATEFP") %>%
dplyr::select(-STATEFP)
dem_90_99 %<>%
group_by(YEAR, STATE, AGE_GROUP, SEX, RACE) %>%
summarize(SUB_POP = sum(SUB_POP_temp), .groups="drop")
# A tibble: 55,080 x 6
YEAR STATE AGE_GROUP SEX RACE PERC_SUB_POP
<dbl> <chr> <fct> <chr> <chr> <dbl>
1 1990 Alabama Under 5 years Female Black 1.12
2 1990 Alabama Under 5 years Female Other 0.0347
3 1990 Alabama Under 5 years Female White 2.28
4 1990 Alabama Under 5 years Male Black 1.15
5 1990 Alabama Under 5 years Male Other 0.0336
6 1990 Alabama Under 5 years Male White 2.43
7 1990 Alabama 5 to 9 years Female Black 1.14
8 1990 Alabama 5 to 9 years Female Other 0.0419
9 1990 Alabama 5 to 9 years Female White 2.29
10 1990 Alabama 5 to 9 years Male Black 1.16
# … with 55,070 more rows
Again, we should check to make sure that we have the total values we would expect. We have the same number of unique values for each of our variables as in with the data from the 80s, so if we collapsed the data for the different additional sub-populations in this data, then we have done it correctly.
Indeed it looks like we have 55,080 rows, which is what we would expect and is the same as the number of rows of the final dem_80_89 data. Looks good!
2000-2010
Again, for this decade we need to combine the data across years.
Rows: 62,244
Columns: 21
$ REGION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ DIVISION <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ STATE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ NAME <chr> "United States", "United States", "United States", …
$ SEX <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ ORIGIN <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ RACE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ AGEGRP <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
$ ESTIMATESBASE2000 <dbl> 281424600, 19176154, 20549855, 20528425, 20218782, …
$ POPESTIMATE2000 <dbl> 282162411, 19178293, 20463852, 20637696, 20294955, …
$ POPESTIMATE2001 <dbl> 284968955, 19298217, 20173362, 20978678, 20456284, …
$ POPESTIMATE2002 <dbl> 287625193, 19429192, 19872417, 21261421, 20610370, …
$ POPESTIMATE2003 <dbl> 290107933, 19592446, 19620851, 21415353, 20797166, …
$ POPESTIMATE2004 <dbl> 292805298, 19785885, 19454237, 21411680, 21102552, …
$ POPESTIMATE2005 <dbl> 295516599, 19917400, 19389067, 21212579, 21486214, …
$ POPESTIMATE2006 <dbl> 298379912, 19938883, 19544688, 21033138, 21807709, …
$ POPESTIMATE2007 <dbl> 301231207, 20125962, 19714611, 20841042, 22067816, …
$ POPESTIMATE2008 <dbl> 304093966, 20271127, 19929602, 20706655, 22210880, …
$ POPESTIMATE2009 <dbl> 306771529, 20244518, 20182499, 20660564, 22192810, …
$ CENSUS2010POP <dbl> 308745538, 20201362, 20348657, 20677194, 22040343, …
$ POPESTIMATE2010 <dbl> 309349689, 20200529, 20382409, 20694011, 21959087, …
OK, the data looks a bit different from the others. First we will remove a couple of variables that we probably don’t need. Also it looks like we have some values for the entire United Sates and we will drop these to be like the other decades.
We can see that there are lots of values that are zero. According to the technical documentation for this data, zero values indicate the total for the other categories of Sex, Origin, Race, and AGEGRP.

So we will drop the total values for SEX, RACE, and AGEGRP by removing the rows where these variables are equal to zero.
We will also want to only select for the total values for Origin as we do not wish to divide the data into subgroups about Hispanic ethnicity because we do not have that information for the first two decades. Thus we will filter for only the rows where Origin is equal to zero.
We will also then remove the REGION, Division, STATE, and Origin variables. We will then rename NAME to be STATE and rename AGEGRP to be like the other decades as AGE_GROUP.
# A tibble: 11,016 x 15
STATE SEX RACE AGE_GROUP POPESTIMATE2000 POPESTIMATE2001 POPESTIMATE2002
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Alab… 1 1 1 99527 99985 99578
2 Alab… 1 1 2 104423 102518 101023
3 Alab… 1 1 3 108325 108412 108059
4 Alab… 1 1 4 108638 107370 107337
5 Alab… 1 1 5 104337 107230 108195
6 Alab… 1 1 6 106491 101466 98949
7 Alab… 1 1 7 110116 110630 110416
8 Alab… 1 1 8 123719 120283 116502
9 Alab… 1 1 9 124961 125443 124751
10 Alab… 1 1 10 115024 117010 119354
# … with 11,006 more rows, and 8 more variables: POPESTIMATE2003 <dbl>,
# POPESTIMATE2004 <dbl>, POPESTIMATE2005 <dbl>, POPESTIMATE2006 <dbl>,
# POPESTIMATE2007 <dbl>, POPESTIMATE2008 <dbl>, POPESTIMATE2009 <dbl>,
# POPESTIMATE2010 <dbl>
Now we need to recode the numeric values to the values in the technical documentation. We can do so by adding labels to each numeric level using the base function factor().
dem_00_10 %<>%
mutate(SEX = factor(SEX,
levels = 1:2,
labels = c("Male",
"Female")),
RACE = factor(RACE,
levels = 1:6,
labels = c("White",
"Black",
rep("Other",4))),
AGE_GROUP = factor(AGE_GROUP,
levels = 1:18,
labels = pull(distinct(dem_77_79,AGE_GROUP), AGE_GROUP)))
glimpse(dem_00_10)
Rows: 11,016
Columns: 15
$ STATE <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama"…
$ SEX <fct> Male, Male, Male, Male, Male, Male, Male, Male, Male,…
$ RACE <fct> White, White, White, White, White, White, White, Whit…
$ AGE_GROUP <fct> Under 5 years, 5 to 9 years, 10 to 14 years, 15 to 19…
$ POPESTIMATE2000 <dbl> 99527, 104423, 108325, 108638, 104337, 106491, 110116…
$ POPESTIMATE2001 <dbl> 99985, 102518, 108412, 107370, 107230, 101466, 110630…
$ POPESTIMATE2002 <dbl> 99578, 101023, 108059, 107337, 108195, 98949, 110416,…
$ POPESTIMATE2003 <dbl> 99627, 99920, 108026, 107749, 109360, 98276, 109893, …
$ POPESTIMATE2004 <dbl> 99788, 99306, 107627, 108666, 109037, 98742, 107653, …
$ POPESTIMATE2005 <dbl> 100316, 99754, 106570, 110278, 108727, 100327, 105151…
$ POPESTIMATE2006 <dbl> 100820, 101251, 106228, 111640, 108847, 103869, 10161…
$ POPESTIMATE2007 <dbl> 101766, 101985, 106243, 112353, 109496, 105175, 99917…
$ POPESTIMATE2008 <dbl> 102304, 102479, 106155, 113305, 110007, 106348, 99921…
$ POPESTIMATE2009 <dbl> 101411, 102688, 106130, 113741, 111167, 106497, 10138…
$ POPESTIMATE2010 <dbl> 99480, 102939, 106324, 112272, 112423, 106593, 102923…
OK, we also want to change the shape of the data so that we have a YEAR variable and each estimate of the population is a value in a new variable called SUB_POP_temp.
We will now clean up the YEAR variable to only be the numeric value by keeping only the last 4 values of each string using the str_sub() function of the stringr package.
Now we will collapse the data for the different RACES and calculate a new SUB_POP value.
Again, the dimensions look as we expect with 60,588 rows. This time we have two levels of SEX, three levels of Race, 11 levels of YEAR, eighteen levels of AGE_GROUP, and fifty one levels of STATE. If we multiply this together we get 16,588. Looks good!
Now we will calculate the total population and percent of the total as we have done with the previous decades.
We can also check that our wrangling was performed correctly by summing the values for the individual sub-populations percentages and seeing if it totals to 100.
# A tibble: 1 x 2
poss_error n
<lgl> <int>
1 FALSE 561
Looks like the percentages for each state for each year all add up to 100, as we would expect. Great! Now we will reassign the dem_00_10 data with this processing.
# A tibble: 60,588 x 6
YEAR AGE_GROUP STATE SEX RACE PERC_SUB_POP
<dbl> <fct> <chr> <fct> <fct> <dbl>
1 2000 Under 5 years Alabama Male White 2.24
2 2000 Under 5 years Alabama Male Black 1.05
3 2000 Under 5 years Alabama Male Other 0.101
4 2000 Under 5 years Alabama Female White 2.12
5 2000 Under 5 years Alabama Female Black 1.03
6 2000 Under 5 years Alabama Female Other 0.0995
7 2000 Under 5 years Alaska Male White 2.35
8 2000 Under 5 years Alaska Male Black 0.165
9 2000 Under 5 years Alaska Male Other 1.37
10 2000 Under 5 years Alaska Female White 2.26
# … with 60,578 more rows
OK, now we are ready to combine all of our demographic data together!